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Abstract of JP2002109536 

PROBLEM TO BE SOLVED: To provide a 
technology of improving the speed of 
generating clustering data for expressing the 
hierarchical clustering of a series of data 
samples. SOLUTION: In this technology, the 
size is increased to select other closest cluster, 
the data samples based on the absolute distance 
from the reference are merged and arranged, the 
data sample which is closest in the limited 
index range is searched, and a plurality of 
clusters are selected when comparing the 
distance by totaling the contribution from a 
plurality of components of each element in the 
order of areas between quartiles of a plurality of 
components of the data samples of each 
element. 
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[0054]Igl 1 -^©t-x \--y->-7)i±>?7 

?-%<DW£& 10 

??XHCg5£U 14©f^ h^X^CTCW:^-© 
1>->7%£^i?*7X* (»KB?7X*1) T* 
£. 

[00551 II12 -t^X hi' 7^*t -fX©#© 
?5X#£jt-?tf.z> 

— M*V^t> ft zm&ttW&cD P^xz zm^-c. -r-x 

h t> => X Z D- ■< XT C S K.W L t^lt -f X<D'A<D t> 5 X * 
[0056] 12a) tHfiOi'7X$1t'fXm 20 

[0057] i 2 b) ^-rfcwn&Mtsiigi-rsa 

7^W<* #:©^ > - F©-* > +■ > & „ 

>^U£$tf^X££3|i£©-rX ^7X*i 

U JMl 2a) ^«tf. ^©SI-'-KKjgbfcig^ 
f-X h^X^^X^ajifi-t'ftfc/J^l^^X* 

©if X £ 0 T J£TF© J: 5 (CK^-T -5 . 

[0 058] 12ba)ifi©fXhi'7^^t^X4 

■Y>^y^>hU, , ifi©i/h07 1 ^Fi'7X3!t-{ 30 

X"£n (ip*^ oitBftsA^x^iMX) ccHjer 

■5= 

[0 05 9] 1 2 b b ) m&<DTZ Mf >XJl/£|fll © 
^y- K©-9->X;l/iLTt9^-r-5. 

[0 06 0] 12bc)iSOfXhi'7X$Ot'fX 
#*5x*iMX£|5)t;t§£l;J:lgi 3&cjftif 0 

[0 06 1 ] 1 2bd) fcl/3tfiE©?-Xh*5X*lJ--f 

xa^s-cs/jN© ?• x h ^ 7 x * if -r x j: o * s t, >tg£ 

«,-e©fi^©-rX h *7X£-iJ--fX£K&rU^©7^X 

[0 06 2] 12be)Ifi©fXl-i'7X$CDg«:* 
©t^X;u#&l:m«. II12bf) tcjfttj. ^e>t? 

m&<D-?z h^^x^©-r<-^©7fc©-!f>7' 
->mc^+>xu ctx*3gffi©^x nf>70i/iurx 

i!2bc) (Cjttf . 

[0 06 3] i 2 b f ) ^a=©?-x h ; !?5x*£&/JY9- 
•YX©7•x h^x^tc-r-s. 
[oo64iigi3 -^^x^fc^sn^c^siafii 

©-y>X;u<bef^ so 



#P3 2 0 0 2 - 1 0 9 5 3 6 
12 

13 a) ^©f-X h i»7X$©i3fiftOt>7';Wc 

fhSWJfX h ^X*(C^£ft&l,>17->:7\>l/£ 
JitT©J:5K:JioW&„ 

[0 06 5] 1 3 a a ) «/Jn|E«M i n D i s t ZMM 

[0 06 6] 1 3ab)lS©fXFi'7X^©St> 

[0 06 7 ] 1 3a b a) i>L.m&<DWa&®y^7)lC 
NS ( i ) *«Hi©7"Xh£-7X£CTC ( i ) ©gjg 
-C&< t Sfc. 3g£©Sjfitel©iCND ( i ) ifi&m 
MM i n D i s t <fc <0 &/J«8tM i n D i 

s t*3iSE©*ia&gg$ICND (i) £LTt95£U 9 
ft©^:? - ^^* ^Xfc'ftsHSP'iU 3i&©«ifi 

ict^^icNs (i) tv-cteto-tz. 

[0 06 8] ^^TfttfilH aoStNX D ( i ) Hfi 
ft/N£KM i n D i s t J: 9<h3<r»«^». 3H3E©aa 
®<DmkC N D ( i ) * >; -te 9 h L/ 
r^©^S5:tT^C<i:iCj;-,-r, ^©tafOf-yT* 

jucns ( i ) tm&<Dmi&m<D&.MCND ( i ) *k 
if-r-s. 

[0069] 13abaa) Ig9 •Com&fDVZ'-? & 
0^©t>7-Jl/NXS ( i ) i^©ig|NXD ( i ) * 

[0070] 13abab) & L-^©ffiSINXD (i) 

#S/J^8i J; «p < ft 6 . f ^ x £ o'Acoyyy' 

)\s*W&0-V>-7)Vt 1 3 a b ) &CK£. 3 & 

fcttftw^ ^©■y->7Vu*>6^©-f->7 , ;i'Nxs 

( i ) $T©^ftiMe a s D i s t£#ij£,, fclMe a 
sD i s t**fi:Sc^©K8iCND (i ) £9*»l>& 
6. I113abad (cilt?) . 1 r9iS3 n/cJS 
S-rSS^©ffiil^^^S'J0. -e©J&5Mi£2itU * 

© 2 ^ffl^a^© 2 *ffi©-^itfii{c*n»-r sciKj:^ 
ribS?5rff ^„ S^ititff^S©*<*:-c, *<D<&nm& 

M&mmmcND ( i ) <o2mmittmo. 

113abad) icmtt. ktm&'lJ ^^ftftt^S^c 
flW*»W. =NF(CD*i^#^ig^©aS*^#3nS 0 
[0071] 13abac) tlMe a s D i s t*^« 

ia^stcND (i) cfc«3/jN$^n«, »ia«s-9->^ 
cns ( i ) tm&mommcND ( i > zmftv-c. 

Zft^ft. ^©if^^UNXS ( i ) iMeasDi s 
tlCTZ. 3 etc. feLMe a s D i s t #M i nD i 
s t J^/hSO&^kE. M i nD i s t ^H^fL/TM e 
asDistiU 3S^©it>7*;U* '^feBlSB'i LT^ 
iWL,. £tc> ^©t>">XJUNXS ( i ) fc'JfcJggfS'i L 

[0 07 2] 1 3 a b a d) ^©1f >^NXS (i) 
*^X©±ti[©. L/ < «. i>:©{g(i©t»-> y')i-~c$>-y tc 

©1f>7-;I/NUS ( i ) . NLS ( i ) ZW6i?Z>. i> 
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V,X(D^yy')lNXS ( i ) &A<D±.{iL-y-y7)\'NU s 
( i ) -Cfo-vtctsihU. ^c±<4©if>7'^NUS 

( i ) fcBSrU-c. hZvzz-ete^m&W'XcDi. 
{4-9->7Mwnus ( i ) <D'&<D'X<D±.{iims- v^y? 

( i ) ^X©fifit>^NLS ( i ) V&itcUh 
[S> 'XOfiSt^^NLS ( i ) 5rJ5grl/T. 3Bfi© 

( i ) (Dmoy-XOfctilMs - K-f>f 9*^4 feot> 

i -f -5 . ff fc&;K©±{&/T{i©lJ- >7';l/N U S 
(i)/NLS ( i ) %U8.UX. ^©1f>7';l/NXS 
( i ) i^OSiNXD ( i ) *«tr*U 1 3 a b a 
b) tcittfo 

[007 3] ^S©^* h HomSVDV^-fJMfi 

mmsinzt. ni3b ^mts. i 3 b > 
V)W$> *) . 3EHW££tr5l&® 7-* nam 

g|$£#t? * 7 * £ i flte 3 fix . {>te©iS 3 *s«/J^Bt 
[0074] 1 3 b a ) ZlM*©-9->:7*;U©|g©{4g£ 

tabic, ^y-KJc^OTtgSrtsn^^^^u-OT 1 

±{4© * 5 * £ ©It > <0 ST -5 C £ (C J: o T . 

ct 0 A£ &fiI5 ^JUfc &o * ? X £ a* J; *) /Jn 3 ft fit© 5 
•*« JU £ 4> o f =y 7, £ tc 1JQZ. 6 tx Z> . 

[0075] 1 3 b b ) 5fc5IS&©-tf->:/;W >"fv 7 
3im%ll<Dy->y')V sOt* i? 7. ft/MEfKM i n D 

i s t (4te<DftS(Cf$lA0 . Tfil©^?^^©?^ 
-»V*i-€-tx-€ e *a source (i), dest (iK he 
iqhtC i ) . join (i) i l/-ClB?lJfc:&*ft3ft£. * 



(8) SW2002- 1 09536 

14 

*CCT, i «±{4©^7^^©7^l'-C*-5. 

[0 0 7 6 ] 1 3 b c ) .fcDTft©^**©^;**: 
tt*fiT4I2?UB**:. cfc9±<4©*:7**©lMXK:J: 

[0 0 7 7 ] 1 3bd) *7X£g&£fi£i>3t!:£. 
[0 07 8 ] III 4 -gjgts 

fcL 1 o©*?*£;rcW#^Tl»£ft6&a : &**7 - r 

t5ttt»ft«I81 2'vjttf. 
[0 0 7 9 ] source(i)>dest(i),heiqht(i)©IB^J«. 
10 ffi^©-9->7'Jl/©-ii*4«/J^S7K ; S:5e»-r4fc*frC 

[0 080] ®m<DW 

CCT. S*©^D«2-C*«3. Itifc©? 1 -*-!*-^ 

^-r03*^@5=Sr#ML.-r > ±©#i£©— Wfct&BJ-f 
<2>. 03 tt. J^T©^©^-*^^^^. 
[0 08 1 ] xCl)=C0,l); x(2)=(7s5); x(3)=(3,7); x 
(4)=(5.1); x(5)=(2,0); x(6)=(8,6); x(7)=(7,6); x 
20 C8)=(2>2); x(9)=C9,8) 

mc-eabzt%. &&©gsR**»ij3 

[0 082] 112 ) -C«. IS2a) ©¥!££[□)«:£ 

n^it^mm^ti^>, (o, i ) 

UTjltRSn, «&©•?> ^l/J&s (1, 5, 8. 4. 
3. 2. 7, 6. 9) &C#WJ3ft. B3-C-f£'; s» i» 

i>Lxm2 b) <Dtmw*V5m*Qiix . xmom 

30 **jWlLA:tt6tf. BBWtt ( 1 . 5. 8. 3. 4. 
2. 7, 6. 9) tttZ. 

[0 083] IH6 . 7 . 9 t?tt. «T©ffl*»6ti 
•5. 

[0 084] 
[&l] 





1 


2 


3 


4 


5 


6 


7 


8 


9 


CND(i): 


2.24 


2 


2 


3.61 


4.12 


1 


I 


1 


2.24 


CNS<i): 


2 


3 


2 


3 


7 


7 


8 


7 


8 


NXD(i): 


2.24 


2 


2 


3.6! 


4.47 


I 


1 


I 


2.24 


NXS(i): 


2 


3 


2 


3 


6 


7 


8 


7 


8 



[00 85 ] IS 1 2 T?tt. If >7VU 1 *i3ia©T-^ r 

*H (C|9:5£3n-5. I113r(i, 1t>7 - JU2*iE3© 
X5»^aT7K3#iS-9->7';H (cM^jn, d77^^2 

ifi7*77.z itc^3ns„ «T©tt^taia3*i€.. 

[0086] 

source(2)=l; dest(2)=2; heiqht(2)=2.24; ioin(2)=l 
^©*>^l/3**iai 2t?3^5E©^^ H^>^£^C 



1t> 7*^3*50 3 ©x u.ybCCJ:or7K3n-51f>7 r Jl/ 
2^^n. ^X£3:ai^X£ ltcf^ns. 

^©ts^^feiisn^o 

[0 08 7] 

source(3)=3; dest(3)=2; heiqhtC3)=2; ioinC3)=l 

^©if>^i/4». ill 2-egiaE©'?-* v^ViVt 

50 ^.5. XlllSftt. ^V)VAtm3<Os.y ^ctcj; 
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x* licMsn*. «T©w*#ia»s*i3. 

[0 088] 

source(4)=4; dest(4)=3; height(4)=3o61;join(4)=l 
^©^>7-;U5«. Ill 2T3i:?£©f X MJ->7*& 
II13tit ■y>-7>\>S\ZmZ<D3--vi><l< l C& 

5lCft^tl5. CtltCcfc-pT. XH3ba)T 

7 ra©^^ - K©{4g©3£&#d*g4 3. OTO 

[0 089] 

source(7)=5; dest(7)=7; height(7)=4.12 ; ioin(7)=5 

6 *» 0 3 ©i » 5? e (C J: -o T S% $ ti £ -fr > ^ 7 (Cflfe 
Sti, *5*$rB# <if>:7JU7#fKcio©iBirca 
£) j»7^j»5K»6S*l*. fclTOflWIlJWEflkSft 
£o 

[0 09 0] 

source(6)=6; dest(6)=7; height (6)=1; ioin(6)=5 

>x<dv>^8 assfccf x h -y^yfrt & <o . 

f (cj:or^snsit->^i/7(c*g^sns. **** 

8#*:7X*5{C#£;*ft. «TOflHlWJB»S*i*. 
[0 09 1 ] 

source(8)=8; dest(8)=7; height(8)=l; ioin(8)=5 
^©•y->^;U9*5^cD^X Y^>V')VtttK). s-vV> 

[0 092] 

source(9)=9; dest(9)=8; height(9)=2.24; ioin(9)=5 
CCT. IU2ttt, fX h 2^X*lM X»4«C± 

^ft«^&©*^x#©&/MMXr&£„ £ 

[0 09 3] h*>^H (COl^T. Ill 3a b 
a) -C«. «ia8l7->:7Vl'CNS (1) tflelD^X* 
^©•9->7 , ;b2T*^ l> sfiSffittlgl 3 a b a a) (c 

->PNUS ( i ) 4;K©1f>7\>UNXS ( 1 ) omftlfiVr 
>7)1>5~C$>Z 0 -XomMNXD (i) =V"4 5-C* 
"3. ftSflSftTcSIStMe a s D i s t 4>IHDfitf*S. 
CtlBM inDist 4&ifi{g^8f C N D ( 1 ) J:*)^ 
$1»©-C. CND (1) =MinDi st=MeasD 
i s tV$>K>. CNS ( 1 ) =5T**S„ HI 

3 a b a d ) r«. ^0±{4lf>7*;bNU S ( i ) iPV 
>-?)l6&-CJ>t")j<>b$tl. #2fi£«XfI13a 
babCCKS. ^©ISiNXD (1) «-if>:7MH 4lf 
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«, MinDi s t J:»)/h3 < ttl»©T. ^Of^ht 
^^^©/cftlcll 1 3 a b ) (CH*. 
[0 094] fX MJ->7 f .>l'2T«. PD^^X^tC* 
•5»*fi^l7'>7';l/CNS (2) =3-C*t>. ##&ttX 
S 1 3 a b a a ) tcjltf. ^©EiNXD ( 2 ) «1f> 
^;U2<blf>7';i/5©^>PtffiSIPaoM-c*0> cn»f 
4 5 -V~5-C£>£„ cn«g/J^§IM i n D i s t J: *) 
/h$ t,>©-C. M e a s D i s t &J1f> X.n> 5 £ "C©^8i 

iurtf-»$n. cn«-r5 0t?*-5. c*i»cnd 

10 (2) iO/J^O©-?. CNS (2) =5. CND 

(2) =Mea sD i s fC*3. LfrVK&hMe 
a s D i s t it£*)>b$ < t£WC. ttmtJLM 1 3 
a b a d fcjfctf. NXS (2)(26fCft9. *^S»X 
11 3ab NXD (2) «V~6 5 --T5 4 

0TH-»Sft£„ cniiMinDi st<t<3'jN$C>(D 
"C. Me a s D i s t «V~5 0 4 LXftWZftZ. Cft 
ttCND (2) CC^b<. Mi nDi s t±*)/J^<ft 
WV. *M»I1 1 3 a b a dtcjttp,, NXS 

( 2 ) « 7 (Cfc 0 . **ftMS 1 3 a b a b tCE5&. 
20 NXD (2) «V~5 0--T5t*5. CtlttMi nD i 
s t <fc*J/h$(,>©-C. Me a s D i s t»V~6 1 4 It 
ftff 3*1-5. Cti»M i n D i st. Wfc. CND 

(2) t t"9'l^<^Ot > *^a«Il 1 3abad 
Kit*. NXS (3) »8(C^(3. *^i*«XH3a 
babfCM6„ NXD (2) »V~8 9-V"5r*-5„ C 
n«M inDi s t J: 9'Jt£ < ttt»©T\ ##&»:#© 
fX h-y->^;I/©/c*(CXgl 3 a b ) tcgl*. 

[0 09 5] f X b^>V)V3Vit. NXS (3) = 
5> NXD (3) =V"4 5 -V~5-C&So CftttMi n 
30 Di stitf/hSl^OT, MeasDi s t«V"26i 
OTtf-gLSn*, cnttCND (3) 4M i nD i s t 
cfcD^3l/>©r. CNS (3) =5, MinDi st = 
CND (3) = MeasDi s t = V~ 2 6 5. X 
mi 3abadt?«, NXS (3) ifiWM^tiX 6 
0. NXD (3) =V"6 5-V"5r&S. cnWMi n 
D i s t J:«vh<?< tewe. ttm&X<DTZ. 
^©rVifetCXS 1 3 a b ) 

[0 09 6 ] fX h^>y')V4Xit. CND (4) =V 
2 OT^fp. CtHZM i n D i s t <fcO>h;*<,>. Sf-p 
40 r. M i n D i s t itCND (4) 4&«). Ztl&99 
XJrtTiiCDf- X h-y>7-JU©4#*:Xl 13 a) « 

[009 7 ] XI 13b) ■9->7 , ;U4 «5feBlg|5r 

3"Ch4 L/T^3tl4. £-7X£5«:?-7X£ 1 (Ctt^ 

[0098] 

source(5)=4; dest(5)=6; height(5)=4.47; ioin(5)=l 

so ^ftw^jfr^. *^5x^«;>^ft(cj;sai^«: 



(10) 
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». WT©lS?iJCsource(i)sdest(i),heiqhtCi),ioin 
(i 
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* [0099] 
[*2 3 



Index t 


sourcefi) 


dest(i) 


heigb(i) 


join(i) 


2 


i 


2 


2.24 


1 


3 


3 


2 


2 


1 


4 


4 


3 


3.61 


\ 


5 


4 


6 


4.47 


1 


6 


6 


7 


1 


5 


7 


5 


7 


4.12 


5 


8 


8 


7 


1 


5 


9 


9 


8 


2.24 


5 



[oioo] i&ffiMum 

■c^*^^<!:^^*s^iS ; S:^Si u&t,>©r. sfcn©*— y > 
or. ?(Djj&fr?> fe/ce.snfcm^j». 

[0101] JS^aOT^'J-Jr-^aXD^O^j&iJ^T-C 

ft a . 0 6 t*sss ©— & «^ fc^-r . ctihv&m 

*Ett*RCCtS*WSC4*J"C»*. fW&gglSPRE 
«. <&Btt6l2r a *»s> * )\&3S&Vi 5 c 4#r 
ttft©f*~£KX&??x*y>j'(Ci&Btt 

*CPtiloJ!jU:©*ja5 r o-fe»^"efeJ:<. *i^7^ 

*y>y55rffi*i8ffL. a^a* y>^7 r -^^a^jr 
&©®3©fc«bK:. f^SSPPOSTtt^vX^ y 

^7x*©aa6^s^f^ci*4r#€>. t^-^w^jsp 

[01023 mwitcmi&zintc-fvuz m«. f 
7^ * y > y^'a-iz -y-c ptc «t ^rsitf 3 ft* 4 t &c 

Bje^-rx*. *>o<«. *©tt©Btt5*£t« so 



©* + y 7 icftm v tc 0 . *t-;7i ©w»««-9©as 

§fs>HT5C4**r*#5. 
[0103JEI 

-T£4. 7 r -*-y>7\>l'g#£. lt©^7X$©«fi!( 

=&faa-rsEE^$nfc7=-t>-fey Kctorst&jL* 

v h»{@^©7 r -^-y->^?r!its-a-r(c 1 m®. 

gtf©**©* s * *> -3«&© f 5 X $ K**»W s c 

m^-ZHv h©i^x*©*'i>^M{cJ:orft;«;*ft 
HBfefc 0©EEW©*§£tt. #*#»r3e««fc»)/iNSl» 
«^*3*fc-3««©*9**K:*4fl-WSC4R:J: 

or. ^^x^F»g©<l^©-y->7';u*i»7X^©s>c>»*i 
€>©»*}"< 4> h;u«cJ:orf^31f-5„ fR^c» hJH* 

ti*tor*ft»©^» h^^ora^-r^ci^rtr 
a. t^&&«iiB©7^-£(c2*Lrfc. BPo. iiksj. ^ 

7 s - z 7 h *§et£j«&K:te*ft-r a c 4 ra&ssttfr left 

m-r z> c 4 rffttffifttt&ttffl-r * c 4 # r * 6 . 
[0104] sfor. xxkmzmtf-rztcibt,^ 7-2 

[0105] Hz^y >7— ^>3>4«Fi!tjaa 

— a©^- ^-y> 7*^*5^©^ 7^ f ©Fiii*«:7>S 

sn^4. **s#gustt£{*^©iS;s£ia3£-r 

^ C 4 Oc J: o r tK^tS!! « © ^ X ^ icfttf 2> c 4 *s-c # 
■5. §f5X^«S^c-5f 7X©f-^4^t^fe© 

r*o. •en«rffi-3r#^7X3Kcs^-5#tt*JStt-^ 
w5c4K«tor7>w-r-5c4*sr*5. s^-^-y-^ 
7 , ;i'©^7x^^©®ffii«:m-rc4*sr#-5. cn». 

^ne>©^^x^«:a-^i»r^7K$n-5«^© 
•y ^Kt&mmttz c 4 cc j; r & $ n*. 
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[0106] y*- h-fe>^>y©#»rtt, muom 

»©«H©JBtt*>* * h JM$tt*JtlW 5Ci &C«fc 
[0107] PHKDttfffi«imi®-k^>9--Va>(C 

Jl/HMltcSBft U 0 r £ fc&K: > H{f!«!IHH-f £ 10 

©t\ *r^yir-^ a >^BBBW©1-3"C**. 
[0108] »OrtND^-*-Ctt. •7 s -t>-9->7 , jH 2 y 

**flKJ©H*»jfc© 2 o©£ Ufc^^iB 

<fc-oT. 1 o©j36»?£W**tott»tt«IW**l!nK:W59 20 
U £E#fr;&9-©X^f hJHttt^Si^-r-S. x 
» s?t*(H *ff o r . c n 6 ©MttttfiRttn©*ff ^i^S'J 
■r*. cti6©«W«*Ctt. HUMcfltW©*^*^ 
tfteiSSfSW©*^* h*Wtt©WfiE*5 r Jl>©7 * ^ 

>y*tToT. #«»©«^*»3er*. #t»i5© 

[0109] U«WlWK>f e -**>^ , *1MfeWf3e 

©/hsft»*aj*tftar*A:»©BBWiiKi. 

*>b<tt. tt&$tttt-$-&ft:&©>/*-- h-b>t/> 

[0110] »o*>©«p3£©T^y-5r-^3>T?«. & 

«»©«#«c-9->^HB©«iit*«jer*. 
€>,SSR2g|8l-c©S»:©if>^i/©^^^$ y>y^tt 

«>. *©£n©ft&?7**fci2tro%lT>:7Jl';A<&S 

ti^m^fts. s^ra©i'7^ : S'iffl©-y->7';H* 
[0 1 1 1 ] fifoT. **fta*3Hf-r*i6«©a«Hii 

SPOST-CB, ^^©i'^X^JcHil^^nfcT 5 - 

*©**fii*{JM6-r*. 
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1 Title cf lnventicQ 

Data Clustering Methods and Applications 

2 Claims 

L A hierarchical data clustering method, including the following steps: 

a. receiving as input a set of data samples; 

b. recording an initial cluster allocation of said data samples; 

c. for each cluster of said data samples, determining the most similar cluster to 
that cluster and recording the dissimilarity thereto, according to a predefined 
dissimilarity function; 

d. recording the identities of the data samples from which the dissimilarity was 
determined; 

e. recording the most similar cluster and the current cluster as a single cluster; . 

f. repeating steps c to e until a predetermined degree of clustering is reached; 
and 

g. providing as output the recorded dissimilarities and associated data sample 
identities; 

wherein at step c, the clusters are taken in order of increasing size. 

2. A method according to claim I. wherein at step c, the cluster having the most 
similar data sample relative to any of the data samples within the current cluster is 
determined as the most similar cluster. 

3. A hierarchical data clustering method, including the following steps: 

a receiving as input a set of data samples; 

b. indexing said set of data samples substantially in order of absolute distance 
from a common reference, according to a predefined absolute distance 
metric; 

c. recording an initial cluster allocation of said data samples; 

d. for each cluster of said data samples, determining the closest cluster to the 
current cluster and recording the distance thereto, according to a predefined 
intersample distance metric; 
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e. recording the identities of the data samples from which the distance was 
determined; 

f. recording the closest cluster and the current cluster as a single cluster, 

g. repeating steps d to f until a predetermined degree of clustering is reached; 
and 

h. providing as output the recorded distances and data sample identities; 
wherein step d includes selecting for distance comparison with each data 
sample within the current cluster only a subset of the data samples outside 
the current cluster within an index range between a higher index having an 
index value higher than that of the current data sample and a lower index 
having an index value lower than that of the current data sample. 

4* A method according to claim 3, wherein step f includes, if the data samples of the 
closest cluster and the current cluster are not adjacent in index, reindexing at least 
some of the data samples so that the data samples of the closest cluster and the 
current cluster are adjacent. 

5. A method according to claim 3 or claim 4, wherein the higher and lower indices are 
determined such that the smaller of the difference between the absolute distance of 
the data sample of the lower index and that of the current data sample, and the 
difference between the absolute distance of the data sample of the higher index and 
that of the current data sample, is greater than the minimum intersample distance 
between the current data sample and any of the data samples within the index range 
and not in the current test cluster. 

6. A method according to claim 5, wherein die higher and lower indices are 
determined by successively reducing the lower index or increasing the higher index 
according to whether the difference between the absolute distance of the data 
sample of the lower index and that of the current data sample is respectively less 
than or greater than the difference between the absolute distance of the data sample 
of the higher index and that of the current data sample, until the smaller of the 
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differences is greater than the minimum intersample distance between the current 
data sample and any of the data samples within the index range and not in the 
current test cluster. 

7. A method according to any one of claims 3 to 6, wherein the absolute distance 
metric is the difference in component in the dimension of the data samples having 
the greatest variation. 

8. A method according to claim 7 t wherein the variation is determined as the range 
within which a predetermined fraction of the data samples fall. 

9- A hierarchical data clustering method including the following steps: 

a. receiving as input a set of data samples each having a plurality of 
dimensions; 

b. determining for each of said dimensions a measure of variation of the data 
samples in that dimension; 

c. sorting the dimensions of the data samples according to their measures of 
variation; 

d. setting initially each data sample as belonging to its own cluster, 

e. taking each cluster in rum, determining the closest data sample to any 
sample in that cluster and not already forming part of that cluster, 

f. merging the cluster of the closest data sample with the current cluster; and 

g. repeating steps e) and 0 until a desired degree of clustering has been 
achieved; 

wherein the measure of variation is the range of a predetermined fraction of the data 
samples excluding the largest and smallest values in that dimension. 

10. A hierarchical data clustering method, including the following steps: 

a. receiving as input a set of data samples; 

b. recording an initial cluster allocation of said data samples; 
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c. for each cluster of said data samples, determining the most similar cluster to 
the current cluster and recording the dissimilarity thereof, according to a 
predefined dissimilarity function of a plurality of dimensions of the data 
samples; 

d. recording the identities of the data samples from which the dissimilarity was 
determined; 

e. recording the most similar cluster and the current cluster as a single cluster; 

f. repeating steps c to e until a predetermined degree of clustering is reached; 
and 

g. providing as output the recorded dissimilarities and associated data sample 
identities; 

wherein step c includes, for each dissimilarity calculation, taking the 
component of the distance measurement in each dimension in order of 
decreasing variation of data samples within each dimension, calculating a 
cumulative dissimilarity value, and terminating the dissimilarity calculation 
if the cumulative dissimilarity value exceeds a comparative dissimilarity 
value. 



11. A data compression method, including: 

a. performing a method according to any preceding claim; 

b. generating a compressed data set based on the set of data samples and the 
output of the method of step a; and 

c. outputting said compressed data set 

12. A method according to claim 1 1, including storing said compressed data set. 

13. A method according to claim 1 1, including transmitting said compressed data set. 



14. 



A feature extraction method, including: 

a. performing as method according to any of claims I to 10; and 
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b. indicating associations between ones of said data samples within the same 
cluster on the basis of the output of the method of step a, 

15. A method according to claim 14, wherein step b includes comparing properties of at 
least one of the data samples with predetermined classification data according to the 
clustering properties of the at least one data samples, and outputting a classification 
indication of the data samples within the same cluster of the basis of the 
comparison. 

16. An unmixing method, including: 

a. performing a method according to any one of claims 1 to 10; 

b. determining at least two characteristic properties of ones of said data 
samples on the basis of the clustering properties of said data samples 
determined by step a; and 

c. determining a mixing proportion of the characteristic properties for at least 
one of the data samples. 

17. A method according to claim 16, including determining at least one boundary 
region between ones of the data samples having one of the characteristic properties; 
wherein in step c, the at least one data samples are within the boundary region. 

18. A method according to claim 16, wherein the boundary region is determined by 
spatial or temporal edge detection. 



19. A method according to claim 17 or 18, including indicating as an anomaly any of 

the data samples within the boundary region of which the value is determined not to 
consist of a mixing proportion of the characteristic properties. 



20. 



A method of data selection, including: 

a. performing as method according to any of claims 1 to 10; and 
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b. selecting for further processing a subset of said data samples on the basis of 
their clustering properties as determined at step a. 

21. A method according to claim 20, wherein said subset is selected to comprise a 
cluster in accordance with predefined clustering criteria within said cluster. 

22. A method according to claim 20, wherein said subset is selected to comprise cluster 
in accordance with predefined clustering criteria relative to other clusters. 

23. A method according to claim 20, including prc-sclccting at least one of said data 
samples, wherein the subset is selected to comprise a cluster including the or each 
pre-selected data sample. 

24. A method according to claim 20, including pre-selecting at least one of said data 
samples, wherein the subset is selected to comprise a cluster excluding the or each 
pre-selected data sample. 

25. A method of generating a network design, including the steps of: 

a. performing a method as claimed in any one of claims I to 10, wherein the 
data samples represent nodes of a network; and 

b. generating a representation of interconnections between the nodes of the 
network in accordance with a minimum spanning tree defined by the output 
of step a. 

26. A method of network construction, including the steps of: 

a. performing a method as claimed in any one of claims 1 to 10, wherein the 
data samples represent nodes of a network; and 

b. creating interconnections between the nodes in accordance with a minimum 
spanning tree defined by the output of step a. 
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27. A method of classifying a test sample relative to a cluster comprising a plurality of 
data samples, including the steps of: 

a. determining the most similar data sample of the cluster to the test sample; 

b. calculating a value associated with the test sample and the cluster, 
dependent on the dissimilarity of the test sample to the most similar data 
sample and the dissimilarity of the most similar data sample to any other 
data sample within the cluster; and 

c. performing further processing steps dependent on the calculated value for 
the cluster; 

wherein at step b, the value is calculated as a function of the dissimilarity of 
the test sample to the most similar data sample and of the dissimilarity of 
the test sample to another data sample most similar to the most similar data 
sample within the cluster. 

28. A method of classifying a test sample relative to a cluster comprising at least three 
data samples, including the steps of: 

a. calculating a value associated with the test sample and the cluster, 
dependent on the dissimilarity between pairs of data samples within the 
cluster, and 

b. performing further processing steps dependent on the calculated value for 
the cluster; 

wherein at step a, the value is calculated as a function of a test sample 
dissimilarity of the test sample to the most similar data sample within the 
cluster, unless the test sample dissimilarity is less than the dissimilarity of 
an edge in a minimum spanning tree which has the greatest dissimilarity less 
than an edge connected to the most similar data sample. 



29. 



A method of according to claim 27 or 28, including calculating a value associated 

with the test sample relative to each of one or more further clusters; 

wherein said further processing steps are performed on the basis of a comparison 
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between the values calculated for each of the clusters. 

30. A pattern recognition method, including: 

a. receiving a test sample; and 

b. performing a method according to any one of claims 27 to 29. 

31. A pattern recognition method, including: 

a. receiving a test sample; 

b. performing a method according to anyone of claims I to 10; and 

c. performing a method according to any one of claims 27 to 29. 

32. A method according to any preceding claim, wherein the data samples are samples 
of physical properties. 

33. A computer program arranged to perform a method according to any preceding 
claim when executed by a suitably arranged processor. 

34. A carrier carrying a computer program according to claim 33. 

35. Apparatus arranged to perform a method according to any one of claims 1 to 32. 

36. Apparatus comprising a data input, a pre-processing stage, a cluster processor 
arranged to perform a method according to any of claims 1 to 10, a post-processing 
stage and a data output. 



3 Detailed 



Description u i I dv cd t i oo 
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Field of the Invention 

The present invention relates in one aspect to a hierarchical data clustering method 
and to processes and applications including that method. In another aspect, the present 
invention relates to a method of classifying a test sample relative to one or more clusters. 
The present invention relates further to processes and applications involving such methods. 

Background of the Invention 

Hierarchical cluster analysis involves the classification of a set of data samples into 
a cluster structure based on the similarity between the samples, without imposing any 
predefined grouping. There is a large body of literature on methods and applications of 
cluster analysis, examples of which are: 

'Data Clustering: A Review' by Jain A. K.., Murty, M. N. and Flynn, P. J., ACM 
Computing Surveys No. 3, vol. 31, p. 264; 

1 Classification', by Gordon, A. D., Chapter 3, published Chapman <& Hall, 1981. 

Single-link cluster analysis is one type of cluster analysis which involves finding 
the 'minimum spanning tree* of a set of data samples. The 'minimum spanning tree* is a 
set of lines or 'edges' which join pairs of data samples such that the total length or 'weight* 
of the edges is a minimum; see for example: 

'Introduction to Algorithms', by Cormen, T. E, Leiserson, C. E., and Rivest, R. L., 
Chapter 24, published MIT Press 1990; 

* Algorithms', by Sedgewick, R.. Chapter 31, Second edition 1988, published by 
Addison Wesley. 

In practical applications, the usefulness of any cluster analysts technique depends 
on its efficiency in terms of speed and storage requirements, which will be functions of the 
number of data samples N, the number of dimensions D of each sample, and the structure 
of the data samples. In the worst case, hierarchical clustering algorithms have time 
requirements of the order of N 2 and can therefore be impractical for large data sets. 

The paper 'An Efficient Interactive Agglomerative Hierarchical Clustering 
Algorithm for Hyperspectral Image Processing' by the present inventor, published in the 
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Proceedings of the SPIE Conference on Imaging Spectrometry, San Diego, California, July 
199S, SPIE Vol. 3438, pp. 210 to 221 describes clustering algorithms which involve 
indexing data points so that nearby points have nearby indices and searching in a restricted 
subspace so as to reduce the number of comparisons which need to be made between pairs 
of points. However, the search is made only in one direction and does not achieve accurate 
single-link clustering. 

European patent publication EP 913 780 A discloses a data clustering method in 
which the total number of distance calculations is reduced by eliminating data samples 
unlikely to be nearest to the sample under consideration before selecting the nearest. 

The paper * A Spectral Unmixing Algorithm for Distributed Endmembers with 
Applications to BtoMedical Imaging', Proceedings of SPIE, VoL 3438, by the inventor of 
the present invention, discloses a method of calculating a likelihood value of a test point 
belonging to a set of data points, based on a hierarchical clustering of the data points. 

Summary of the Invention 

According to one aspect of the present invention, there is provided a data clustering 
method including the following steps: 

a. Receiving as input a set of data samples; 

b. Setting initially each data sample as belonging to its own cluster, 

c. Taking each cluster in rum, determining the closest data sample to any sample 
in that cluster and not already forming part of that cluster; 

d. Merging the cluster of the closest data sample with the current cluster; and 

e. Repeating steps c and d until a desired degree of clustering has been achieved; 
wherein at step c the clusters are taken in order of increasing cluster size. 

An advantage of this method is that the number of distance or dissimilarity 
measurements which need to be calculated is greatly reduced, because there are generally 
fewer distances between samples in the smallest current cluster and another cluster of 
larger size than between samples in clusters both of larger size than the current smallest 
cluster. 

According to another aspect of the present invention, there is provided a data 
clustering method including the following steps: 
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a. Receiving as input a set of data samples; 

b. Setting initially each data sample as belonging to its own cluster, 

c. Taking each cluster in turn, determining the closest data sample to any sample 
in that cluster and not already forming part of that cluster; 

d. Merging the cluster of the closest data sample with the current cluster; and 

e. Repeating steps c and d until a desired degree of clustering has been achieved; 
wherein at step c, the closest data sample is determined by searching over a 
restricted subspace defined by an index range of the data samples indexed 
according to distance from a reference. 

An advantage of this method is that, by indexing the data samples according to 
distance from a reference, there will be a maximum and minimum index relative to a 
sample under consideration within which the nearest sample must be contained and the 
number of samples which need to be compared to the sample under consideration is greatly 
reduced, without compromising the accuracy of the clustering. 

According to another aspect of the present invention, there is provided a data 
clustering method including the following steps: 

a. Receiving as input a set of data samples each having a plurality of dimensions; 

b. Determining for each of said dimensions a measure of variation of the data 
samples in that dimension; 

c. Sorting the dimensions of the data samples according to their measures of 
variation; 

d. Setting initially each data sample as belonging to its own cluster; 

e. Taking each cluster in turn, determining the closest data sample to any sample 
in that cluster and not already forming part of that cluster; 

f. Merging the cluster of the closest data sample with the current cluster; and 

g. Repeating steps e) and 0 unM * a desired degree of clustering has been achieved; 
wherein the measure of variation is the range of a predetermined fraction of the data 
samples excluding the largest and smallest values in that dimension. 

An advantage of this method is that the dimensions which are most likely to be of 
significance in determining dissimilarities between samples are considered first and it is 
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therefore often unnecessary when making comparisons between samples to consider all of 
the dimensions. 

According to another aspect of the present invention, there is provided a method 
of classifying a test sample relative to a cluster comprising a plurality of data samples, 
including the steps of: 

a. determining the most similar data sample of the cluster to the test sample; 

b. calculating a value associated with the test sample and the cluster, 
dependent on the dissimilarity of the test sample to the most similar data 
sample and the dissimilarity of the most similar data sample to any other 
data sample within the cluster; and 

c. performing further processing steps dependent on the calculated value for 
the cluster; 

wherein at step b, the value is calculated as a function of the dissimilarity of 
the test sample to the most similar data sample and of the dissimilarity of 
the test sample to another data sample most similar to the most similar data 
sample within the cluster. 
An advantage of this method is that the calculated value is calculated with 

reference to an edge rather than an individual sample and provides a smoother variation in 

value in regions intermediate the samples joined by the edge. 

According to another aspect of the present invention, there is provided a method 
of classifying a test sample relative to a cluster comprising at least three data samples, 
including the steps of: 

a. calculating a value associated with the test sample and the cluster, dependent on the 
dissimilarity between pairs of data samples within the cluster; and 

b. performing further processing steps dependent on the calculated value for the 
cluster; 

wherein at step a, the value is calculated as a function of a test sample dissimilarity 
of the test sample to the most similar data sample within the cluster, unless the test 
sample dissimilarity is less than the dissimilarity of an edge in a minimum spanning 



tree which has the greatest dissimilarity less than an edge connected to the most 

similar data sample. 

An advantage of this method is that the test sample is given greater weight 
when close to the shortest edge than when close to the next shortest edge and therefore 
not as close to the tightest region of the cluster. 



Description of Specific Embodiments 
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A method according to one embodiment of the invention is described below. An 
array of N samples each of D dimensions is provided as input 

Step 1 - Rank Dimensions 

The interquartile range, in other words the range between the first and third 
quartiles and hence containing the middle 50% of the samples, is calculated for each of the 
D dimensions. The interquartile range is an advantageous measure in this case, because it is 
little affected by stray samples at the extremes of the range and therefore gives a good 
representation of the variation for the majority of the samples. The dimensions of the 
sample array are reordered in order of decreasing interquartile range, or a ranking order of 
the dimensions is stored in a dimension rank array. 

Step 2 - Reorder Data Samples 

The data samples are reordered within the array in one of two ways: 

2a) Radial Ordering: the samples are reordered in order of increasing distance from 

an origin, which is selected for example to be the sample with the smallest component in 

the dimension with the largest interquartile range. 

2b) Linear Ordering: the samples are reordered in order of increasing value in a 

selected dimension (preferably the dimension with the largest interquartile range). In both 

cases, the original index is stored in a one-dimensional array so as to allow identification of 

individual samples. 

Step 3 - Create Binary Tree Leaf Nodes 

Each sample is assigned to a corresponding leaf node of a binary tree so that the I th 
sample after reordering is assigned initially to the t 4h leaf node of the binary tree, and the 
assignment is stored in an array. 

Step 4 - Create Cluster Labels 

An array of cluster labels is created indicating the cluster to which each sample 
belongs. Each sample is initially considered to belong to its own cluster and hence the 
cluster label is initially the index number of the sample. 
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Step 5 - Record Cluster Size and Number 

The size (number of samples) of each cluster and the number of clusters (initially 
N) is recorded. 

Step 6 - Record Nearest Distance 

The distance from each unmerged sample to the nearest sample in a different cluster 
is stored as a variable CND(i) (Current Nearest Distance). 

Step 7 - Record Nearest Sample 

The index of the nearest sample is stored as an integer CNS(i) (Current Nearest 
Sample). 

Step 8 - Record Merge Height 

The distance of each unmerged sample to the cluster to which it is merged is stored 
as the * merge height*. Initially, no merging has been done so that the distance is set as 
infinity (i.e. a maximum value). 

Step 9 - Record Inter-sample and Next Distances 

For each sample, the samples are found having the next highest and lowest leaf 
node indices not within the same cluster; these will be referred to as the 'next upper' and 
'next lower' samples NUS(i), NLS(i) respectively. For example, if the test sample is at leaf 
node i, initially the next upper sample NUS(i) will be at leaf node /+1 and the next lower 
sample NLS(i) at leaf node r-l, because each sample initially belongs only to its own 
cluster. 

For each sample and its next upper and lower samples, the 'absolute distance' from 
the origin is calculated. If radial ordering was used at step 2a), then the 'absolute distance' 
is the radial distance from the chosen origin. If linear ordering was used at step 2b), then 
the 'absolute distance* is the distance in the chosen direction from the chosen origin, which 
is preferably the sample with the smallest component in the chosen direction. 
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Next, the difference between the absolute distance of the sample and the absolute 
distance of the next upper sample NUS(i), and the difference between the absolute distance 
of the sample and the absolute distance of the next lower sample NLS(i) are calculated. 
The smaller of these two differences is stored as the 'next distance* NXD(i) of the sample 
and the sample index of the next upper or next lower sample NUS(i), NLS(i) which gave 
the smaller of the two differences is stored as the 'next sample' NXS(i). 

Step 10 - Set Test Cluster Size 

Test Cluster Size TCS is set initially to I . 

Step 1 1 - Set Current Test Sample and Cluster 

The current test sample CTS is set initially to sample index 1 , and the current test 
cluster CTC is the cluster containing that sample (initially cluster I). 

Step 12 - Find Next Cluster of Test Cluster Size 

The clusters are examined in the order in which they appear in the binary tree to 
find the next cluster having a size equal to the test cluster sire TCS. This is done as 
follows: 

12a) If the current cluster size is equal to the test cluster size, then step 12 is 
complete. 

12b) Otherwise, jump to the sample at the next leaf node immediately following the 
current test cluster, which will always be grouped in consecutive leaf nodes in the tree. 
Make the cluster containing this sample the current test cluster and go to step 12a). If the 
last leaf node has already been reached, set the test cluster size as the size of the smallest 
cluster present, as follows: 

12ba) Increment the current test cluster size and set the 'current minimum 
test cluster size' to N (i.e. the maximum possible cluster size). 

12bb) Set the current test sample to be the sample at the first leaf node. 

12bc) If the size of the current test cluster is the same as the current test 
cluster size, then go to step 13. 
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12bd) If the size of the current test cluster is less than the current minimum 
test cluster size, update the current minimum test cluster size and store the current test 
cluster as the minimum size test cluster. 

12 be) If there are no more samples in the tree after the current test cluster, 
go to step I2bf)- Otherwise, jump to the sample in the tree immediately following the 
current test cluster, make this the current test sample, and go to step 12bc). 

1 2bf) Make the current test cluster the minimum size test cluster. 

Step 13 - Merge with Nearest Sample Not Contained in Cluster 

13a) The nearest sample to the current test cluster not itself contained in the test cluster is 

found as follows: 

13aa) Set Minimum Distance MinDist as infinity (i.e. a maximum value) 
13ab) For each sample in the current test cluster, do the following: 

13aba) If the current nearest sample CNS(i) is not a member of the current 
test cluster CTC(i), and if the current nearest distance CND(i) is less than the 
minimum distance MinDist, set the minimum distance MinDist to be die current 
nearest distance CNDOX store the current sample index as the 4 head* and the 
current nearest sample CNS(i) as the * tail*. 

Otherwise, if the next distance NXD(i) is less than the minimum distance 
MinDist, update the current nearest sample CNS(i) and current nearest distance 
CND(i) by resetting the current nearest distance CND(i) to infinity and proceed as 
follows: 

13abaa) Find the next sample NXS(i) and the next distance NXD(i) 
for the current sample as in step 9. 

13abab) If the next distance NXD(i) is not less than the minimum 
distance, return to step 13ab) with the next sample in the cluster as the 
current sample. Otherwise, measure the distance MeasDist from the current 
sample to the next sample NXS(i). If MeasDist is greater than the nearest 
distance CND{i), then go to step I3abad). The comparison is performed by 
measuring the component of the distance in each dimension in the order 
determined at step 1, squaring the value of that component, and adding the 
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squared value to a sum of squared values. After each summing operation, 
the sum is compared with the square of the nearest distance CND(i), and if 
greater, then the operation proceeds to step 13abad) without summing any 
more terms. This method of performing the comparison avoids unnecessary 
calculations and gives improved speed, particular if D is large. 

13abac) If MeasDist is less than the nearest distance CND(i), 
update the nearest sample CNS(i) and the nearest distance CND(i) to be the 
next sample NXS(i) and MeasDist respectively. If furthermore MeasDist is 
less than MinDist, update MinDist to be MeasDist, store the current sample 
as the 'head* and the next sample NXS(i) as the 'tail*. 

1 3abad) Update the next upper or next lower sample NUS(i), 
NLS(i), according to whether the next sample NXS(i) was the next upper or 
next lower sample. If the next sample NXS(i) was the next upper sample 
NUS(i) f then update the next upper sample NUS(i) to be the sample with die 
next higher leaf node index following the current next upper sample NUS(i) 
not in the test cluster. If the next sample NXS(i) was the next lower sample 
NLS(i), then update the next lower sample NLS(i) to be the sample with the 
next lower leaf node index before the current next lower sample NLS(i) not 
in the current test cluster. Recalculate the next sample NXS(i) and next 
distance NXD(i) taking into account the new next upper/lower sample 
NUS(i)/NLS(i); then go to I3abab). 
Once the last sample in the current test cluster has been processed, go to step 1 3b. 

13b) At this stage the 'head 1 and Mail' are the samples to be joined in the minimal spanning 
tree, and the current test cluster which contains the head is merged with the cluster which 
contains the tail, with the merge height set to be the minimum distance. The cluster having 
the higher value label is added to the cluster having the lower value label. This is done as 
follows: 

13ba) The leaf positions of samples in the binary tree arc rearranged such that the 
smaller cluster and any samples between the smaller and the larger cluster are swapped in 
leaf node position so that the smaller and larger clusters become adjacent The sample 
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indices stored against each leaf node are updated to reflect the swap. The cluster having the 
higher value label is added to cluster having the lower value label by assigning the lower 
cluster label to the samples of the higher cluster. 

13bb) The head and tail sample indices, the minimum distance MinDist (which is 
equal to the merging height) and the lower cluster label are stored in an array respectively 
as source(i), dest(i) t height(i) and join(i), where i is the higher cluster label. 

13bc) The array element storing the size of the lower cluster is increased by the size 
of the higher cluster. 

13bd) The number of clusters is decremented. 

Step 14 -Repeat 

If there is only one cluster left, end the procedure; otherwise go to step 12. 

The array source^), dest(i) and height(i) are sufficient to define the minimum 
spanning tree and binary tree of the samples. Join(i) provides redundant information which 
nevertheless saves subsequent processing steps. 

Specific Example 

An example of the above method will now be described with reference to Figures 3 
to 5. in a simple example where the number of dimensions D is 2 and the number N of data 
samples or 4 patterns , x is 9. Figure 3 shows the values of the data samples as follows: 

x(l)=(0,l); x(2)=(7,5); x(3)=(3J); x(4)=(5,l); x(5)=(2,0); x(6)=(8,6); x(7)=(7, 6); 

x(8>=<2,2); x(9)=(9,8). 

At step 1 ), there is no need to reorder the dimensions as the interquartile range is 
the same for both x and y. 

At step 2), the radial ordering method of step 2a) is used. The sample (0,1) is 
chosen as the origin and the samples are reordered (1, 5, 8, 4, 3, 2, 7, 6, 9) and re- 
indexed in their new order, shown in italics in Figure 3. If the linear ordering 
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method of step 2b) were used and the x dimension chosen, the order would be (1, 5, 
8, 3, 4, 2, 7. 6, 9). 



In steps 6, 7 and 9, the following values are obtained: 



Sample 
Index: 


1 


2 


3 


4 


5 


6 


7 


8 


9 


CND(i): 


2.24 


2 


2 


3.61 


4.12 


1 


I 


1 


2.24 


CNS(i): 


2 


3 


2 


3 


7 


7 


8 


7 


8 


NXD(i): 


2.24 


2 


2 


3.61 


4.47 


I 


I 


1 


2.24 


NXS(i): 


2 


3 


2 


3 


6 


7 


8 


7 


8 



At step 12, sample 1 is selected as the current test sample and the test cluster label 
is also set to 1. At step 13, sample 2 is joined to sample 1 as shown by edge a in 
Figure 3, and cluster 2 is merged into cluster L The following information is 
recorded: 

source<2)=l; dest(2)=2; height{2)=2.24; join(2)=l 



Next, sample 3 becomes the current test sample at step 12 as it belongs to the next 
cluster. At step 13, sample 3 is joined to sample 2 as shown by edge b in Figure 3, 
and cluster 3 is merged into cluster I. The following information is recorded: 
source(3)=3; dest(3)=2; height(3)=2; join(3)=l 

Next, sample 4 becomes the current test sample at step 12. At step 13, sample 4 is 
joined to sample 3 as shown by edge c in Figure 3, and cluster 4 is merged into 
cluster 1, The following information is recorded: 
source{4)«4; dest(4)=3; height(4)«3.6l; join(4)«l 

Next, Sample 5 becomes the current test sample at step 12. At step 13, sample 5 is 
joined to sample 7, as shown by edge d in Figure 3, and cluster 7 is merged into 
cluster 5. This necessitates a swap in leaf node position between samples 6 and 7 in 
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step I3ba), so thai the tree now appears as shown in Figure 3. The following 
information is recorded: 

source<7>=»5; dest(7)=7; height{7)~4.l2; join(7>=5 

Next, sample 6 becomes the current test sample because it is at the next leaf node. 
Sample 6 is joined to sample 7 as shown by edge e in Figure 3 and cluster 6 is 
merged into cluster 5 (of which sample 7 is already a member). The following 
information is recorded: 
source(6)=6; dest(6)=7; height(6)= 1 ; join(6)=5 

Next, sample 8 becomes the current test sample and is joined to sample 7 as shown 
by edge / Cluster 8 is merged into cluster 5 and the following information is 
recorded: 

source(8)=8; dest(8)=7; height(8)=l; join(8)=5 

Next, sample 9 becomes the current test sample and is joined to sample 8 as shown 
by edge g. Cluster 9 is merged into cluster 5 and the following information is 
recorded: 

source(9)=9; dest(9>=8; height<9)=2.24; join(9)=5. 

Now, the test cluster size increases to 4 in step 1 2, as this is the minimum size of 
cluster present, and cluster 1 is taken as the current test cluster. Samples 1 to 4 are 
taken in turn, in leaf node order. 

With test sample I, at step 13aba) the nearest sample CNS(l) is sample 2, which is 
in the same cluster and the method proceeds to step I3abaa). As there is no next 
lower sample, the next upper sample NUS(i) and the next sample NXS(l) are both 
sample 5. The next distance NXD(i) = V45, as is the measured distance MeasDist. 
This is less than MinDist and the nearest distance CND(l), so CND(l) = MinDist = 
MeasDist and CNS(l)=5. At step I3abad) the next upper sample NUS(i) is 
incremented to sample 6 and the method returns to step 13abab. The next distance 
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NXD(l ) is the difference between the absolute distances of samples I and 6, which 
is V65. This is not less than MinDist, so we return to step 13ab) for the next test 
sample. 

With test sample 2, the nearest sample CNS(2)=3 which is in the same cluster and 
the method proceeds to step 1 3abaa). The next distance NXD(2) is the difference 
between the absolute distances of samples 2 and 5, which is V45-V5. This is less 
than the minimum distance MinDist, so MeasDist is calculated as the distance to 
sample 5, which is V50. This is less than CND(2), so CNS(2) = 5 and CND(2) - 
MeasDist. However, MeasDist is not less than MinDist, so the method proceeds to 
step I3abad. NXS(2) becomes 6, and the method returns to step I3abab. NXD(2) is 
calculated as V65-V5. This is less than MinDist, so MeasDist is calculated as ^50. 
This is equal to CND(2) and not less than MinDist, so the method proceeds to step 
13abad, NXS(2) becomes 7, and the method returns to step 13abab. NXD(2) is >/50- 
>/5. This is less than MinDist, so MeasDist is calculated as V61. This is not less than 
MinDist or CND(2), so the method proceeds to step 13abad, NXS(3) becomes 8, 
and the method returns to step 13abab. NXD(2) is V89 - V5. This is not less than 
MinDist, so the method returns to step I3ab) for the next test sample. 

With test sample 3, NXS(3)=5 and NXD(3)=V45-V5. This is less than Mindist, so 
MeasDist is calculated as V26. This is less than CND(3) and MinDist, so 
CNS(3)=5, MinDist= CND(3) = MeasDist = V26. At step I3abad. NXS(3) is 
updated to 6 and NXD(3)=V65 - V5. This is not less than MinDist, so the method 
returns to step 13ab) for the next test sample. 

With test sample 4, CND(4) = >/20, which is less than MinDist. Hence, MinDist 
becomes CND(4), and step 13 a) is complete as this is the last test sample in the 
cluster. 
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At step 1 3b), sample 4 is the head and sample 6 is the tail. The corresponding edge 
is shown as A in Figure 3. Cluster 5 is merged into cluster 1 and the following 
information is recorded: 

source(5)=4; dest(5)=6; height(5)=4.47; join(5)=l 



As there is only one cluster left, the clustering method halts. The output of the 
clustering method comprises the array (source(i), dest(i), height(i), join(i)) as 
follows: 



Index/ 


source(i) 


dest(i) 


height(i) 


join(i) 


2 


I 


2 


2.24 
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3 
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3 


3.61 


1 
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4.47 


1 
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6 
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5 


7 


5 


7 


4.12 


5 


8 


8 


7 


1 


5 


9 


9 


8 


2.24 


5 



Technical Processes 

The method described above can be applied to any process or application involving 
a hierarchical clustering algorithm, preferably a single-link algorithm. However, the above 
method requires considerably fewer operations and therefore can be executed significantly 
faster, for a given platform, than known single-link clustering methods. The method can be 
applied to physical data samples, that is samples of physical quantities. The output of the 
method therefore represents an underlying physical structure of the physical quantities. 

Some of the known applications will be described below, and categorised into 
generic types of process. Figure 6 shows the general form of apparatus for carrying out 
these processes, comprising a data input I, a pre-processing stage PRE, a clustering 
processor CP, a post-processing stage POST and a data output O. These stages do not 
necessarily represent discrete physical components. The data input I may be a sensor or 
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sensor array, or in the case of non-physical data, a data input device or network of such 
devices. The input data may be stored on storage means prior to further processing. The 
pre-processing stage PRE may perform analog-to-digital conversion, if necessary, and may 
also restrict the dimensions of the data to those required for clustering. The clustering 
processor CP, which may be one or many physical processors, performs the clustering 
method and outputs the clustering data. The post-processing stage POST may partition the 
single hierarchical cluster into multiple clusters for subsequent processing, in accordance 
with the clustering structure, and may perform automatic classification of the clusters based 
on their clustering structure and/or the properties of the data samples within the cluster. 
The data output O may be a display, a printer, or data storage means, for example. 

Embodiments of the present invention include a program which performs a method 
in accordance with the invention when executed by a suitably arranged processor, such as 
the clustering processor CP. The program may be stored on a carrier, such as a removable 
or fixed disc, tape or other storage means, or transmitted and received on a carrier such as 
an electromagnetic signal. 

Compression 

Once a set of daia samples has been classified into a hierarchical tree of clusters, 
the data samples themselves can be replaced by a compressed data set which describes the 
form of the clusters. In the case of lossy compression, the compressed data set represents 
the general Form of the clusters without specifying the individual data samples. For 
example, the tree is separated into multiple clusters each having merging heights less than a 
predetermined value, and the data samples within each cluster are represented in the 
compressed data set by the coordinates of the centroid of the cluster. In the case of lossless 
compression, the tree is separated into multiple clusters each having merging heights less 
than a predetermined value, and individual samples within the cluster are represented by 
differential vectors from the centroid of the cluster; the differential vectors will have a 
smaller range than the absolute coordinates of the samples, and may therefore be 
represented using fewer bits. This technique is applicable to any type of data, whether 
physical data such as image, audio, video or quantities not directly perceptible by humans. 



4563 2 002-109536 



or non-physical data such as economic data. The compressed data set may be stored on a 
storage medium, giving more efficient storage, or transmitted over a communications link 
or local bus* giving reduced bandwidth requirements. 

Hence, apparatus may be provided for carrying out this process, in which the data 
output O is a data store or data channel 

Segmentation and Feature Extraction 

Once a set of data samples has been classified into a hierarchical tree of clusters, 
the tree may be divided into separate clusters, for example by setting a merge height at 
which the tree is divided. Each cluster represents a different class of data and may be used 
for analysis by attributing a different property to each cluster. The membership of each data 
sample to its cluster may be indicated, for example by colour-coding the samples in a 
display according to their cluster. 

In the field of remote sensing, a similar technique may be used to display different 
object or terrain types. The display may be interpreted by a user or processed to provide 
automatic identification, for example by comparison with the shape or spectral properties 
of known object or terrain types. 

A similar technique may be applied in image segmentation, where an image is 
partitioned into areas of similar colour or shade, for the purpose for example of converting 
a colour image into a grayscale image or a bitmap image into a vector image. This 
application is also an example of compression, in that the greyscale or vector image 
requires fewer bits than the original image. 

In some cases, there will be an overlap between the segments of a data sample set, 
and it is then desirable to estimate the proportion of each segment type present in the 
overlap area. For example, in an image there may be areas which represent a mixing of two 
main components of the object of the image, such as trees and grass in a remote sensing 
image. Pure areas which contain only one component are first identified by finding data 
samples which are tightly clustered, and the spectral properties of these pure components 
are determined. Edge detection is then performed to identify the boundaries between these 
pure areas. In these boundary areas, the proportion of each component is then determined 
by fitting a mixing model of the spectral properties of the pure areas to the spectral 
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properties of the boundary areas. An example of this technique is described in the paper 'A 
Spectral Unmixing Algorithm for Distributed Endmembers with Applications to 
BioMcdical Imaging' as referenced above. 

If the properties of a data sample in the boundary areas cannot be fitted to within a 
given tolerance to the mixing model, the data sample is flagged as an anomaly and may be 
highlighted on a display. Anomaly detection is useful in medical imaging for the detection 
of small abnormalities, such as tumours, and in remote sensing for the detection of unusual 
objects or features. 

In some specific applications, the boundary between pure samples is determined by 
the spatial or temporal properties of the samples. However, the boundary may be defined 
purely with reference to the clustering properties of the samples in their dimension space, 
so that tight clusters within that space are considered to contain pure samples, and samples 
between those clusters in the dimension space are considered to be mixed samples. 

Hence, in apparatus which carries out this process, the post-processing step POST 
may generate data flags or labels associated with individual clusters, and the data output O 
provides an indication of the data flags or labels, such as a false colour or outline display of 
the data samples. 

Data Mining/Browsing 

In this technique, the cluster structure is used to select for inspection a subset of a 
large collection of data. In one case, one initial data sample is found and the other members 
of a cluster to which the initial data sample belongs are selected for inspection. One 
application of this technique is in the field of document searching and retrieval, such as 
web searching. 

In another case, clusters are selected which have the desired properties, such as 
tight clustering (e.g. large numbers of samples with low merge heights) and the members of 
the selected clusters are inspected. One application of this technique is in the field of data 
mining, in which tight clusters within a large database are selected and analysed so as to 
make inferences based on the members of the cluster. Alternatively, the desired property 
may be a very loose clustering, for example in the field of fraud detection where a data 
sample that is dissimilar to other samples may indicate fraudulent activity. 
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Where the data samples are non-metric (i.e. they do not take the form of an array of 
measurement values), a dissimilarity function must be chosen so as to represent 
numerically the difference between any two data samples. The dissimilarity function may 
be a function of the number of similar words appearing in two documents, for example. 

Hence, in apparatus for carrying out this process, the data input I may be a database 
and the data output O may be an identification of the selected subset of data, for example 
on a terminal. 

Network Design 

The minimum spanning tree represents a network connecting each data sample to at 
least one other sample such that the total edge length or weight is minimized, and therefore 
represents an optimum solution to real-life problems in which nodes need to be connected 
together in a network with maximum efficiency. Such problems include circuit design, in 
which the distance between data samples represents the length of wiring needed to 
interconnect circuit nodes. Similarly, where the data samples represent communications 
nodes and the distance between them represents the inefficiency incurred by 
interconnecting them, the minimum spanning tree represents the most efficient way of 
interconnecting the nodes. The method for finding the minimum spanning tree according to 
the above method may be applied to any such real-life problems. 

Hence, in the apparatus for carrying out this process, the data input I may provide a 
data file representing the properties of nodes to be connected, and the data output O may 
represent a design for interconnecting the nodes. The design may be a graphic 
representation or a series of instructions to carry out the design. The series of instructions 
may be carried out automatically so as to create interconnections according to the design. 

Pattern Recognition 

Pattern recognition involves the classification of a'new data sample based on its 
similarity to a set of data samples which have already been classified. 
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Cluster Rank Function 

In pattern recognition, it is useful to generate a rank function for a given cluster of 
data samples. The rank function is a function of the dimensions of the data samples and 
gives a rank value of a new data sample as a member of the given cluster. The rank value 
can be used to determine in which cluster a new data sample should be classified. 

A method will now be described for generating the rank function for a given 
cluster, given the data recorded above which defines the minimum spanning tree. The data 
which defines the minimum spanning tree need not be obtained by the clustering method 
described above, but this is preferable in view of the speed advantages. 

Preprocessing - Reorder Output in Merge Height Order 

For ease of subsequent processing, the output array is reordered in height order. In 
the specific example, the reordering is as follows: 



Index i 


source(i) 


dest(i) 


height(i) 


join(i) 
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I 
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1 


5 


1 


2 


2.24 


1 


6 


9 


8 


2.24 
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7 


4 


3 


3.61 


1 


8 


5 


7 


4.12 


5 


9 


4 


6 


4.47 


I 



The result may be represented as shown in Figure 7, in which the samples are 
reordered to avoid any of the merge lines crossing. 

Rank Function Contours 

For ease of explanation, the contours of the rank function will now be described 
with reference to Figure 8, although it is not necessary to calculate the shape of the 
contours in order to calculate the value of the rank function for a new data sample. 

First, hyperspheres (in our example, circles) having a radius equal to the smallest 
height(i) are drawn around each sample joined at the smallest height. In our example the 
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hyperspheres are drawn around each of samples 6, 7 and 8. The perimeter of the 
overlapping hyperspheres is assigned a probability of (N-^(l)-l )/(N- 1 ), where y{\ ) is the 
number of edges formed at that height. In this case, the probability is 6/8. 

We then proceed to the next smallest merge height of 2, which joins samples 2 and 
3. Around those samples, we draw hyperspheres of radius equal to the next smaller 
merging height (=1), to which the perimeter is also assigned a probability of 6/8. Next, 
around all of the samples of the current merge height or below, we draw hyperspheres of 
the current merge height, and assign to their perimeter a probability of (N-j<2)-l)/(N-l), 
where y(2) is the number of edges formed at the current merge height or below. In this 
case, the probability is 5/8. 

We then proceed to the next smallest merge height of 2.24, which applies to 
samples 1, 2, 8 and 9. We draw around them hyperspheres of radius equal to the next 
smaller merge height (=2), to which a probability of 5/8 is also assigned. Next, around all 
of the samples of the current merge height or below, we draw hyperspheres of the current 
merge height and assign to their perimeter a probability of (N-y(3>l)/(N-l), where y(3) is 
the number of edges formed at the current merge height or below. In this case, the 
probability is 3/8. 

We then proceed to the next smallest merge height of 3,61, which applies to 
samples 3 and 4. We draw around them hyperspheres of radius equal to the next smaller 
merge height (^2.24), to which a probability of 3/8 is also assigned. Next, around all of the 
samples of the current merge height or below, we draw hyperspheres of the current merge 
height and assign to their perimeter a probability of (N-y(4)-l)/(N-l), where y(4) is the 
number of edges formed at the current merge height or below. In this case, the probability 
is 2/8. 

We then proceed to the next smallest merge height of 4.12, which applies to 
samples 5 and 7. We draw around them hyperspheres of radius equal to the next smaller 
merge height (=3.61), to which a probability of 2/8 is also assigned. Next, around all of the 
samples of the current merge height or below, we draw hyperspheres of the current merge 
height and assign to their perimeter a probability of (N-y(5)- 1 )/(N- 1 ), where y(5) is the 
number of edges formed at the current merge height or below. In this case, the probability 
is 1/8. These circles are not shown in Figure 8, as there is insufficient space. 
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Finally, we reach the largest merge height of 4.47, which applies to samples 4 and 
6. We draw around them hyperspheres of radius equal to the next smaller merge height 
(=4.12), to which a probability of 1/8 is also assigned Next, around all of the samples of 
the current merge height or below, we draw hyperspheres of the current merge height and 
assign to their perimeter a probability of (N-y(6)- 1 )/(N-t), where y(6) is the number of 
edges formed at the current merge height or below. In this case, the probability is 0/8. 
These circles are not shown in Figure 8, as there is insufficient space. 

To calculate the value of the rank function for a test data sample, we interpolate 
between the perimeters of the circles. Within the circles of smallest radius, we interpolate 
up to rank function- 1 at the centre if the centres arc the head and tail of the smallest edge. 
Otherwise, for the samples of next smallest merge height, the rank function is constant 
within circles of the smallest radius and set to the value at the boundary. 

Rank Estimation - Spherical Case 

The calculation of the rank value by interpolation will now be described in detail 
below, in a first order case as described above in which hyperspherical boundaries are 
defined. 

Step 15 - Find Absolute Distance 

For each sample, the radial or linear 'absolute distance' from the origin is 
calculated, as in step 2 above. 

Step 16 - Sort Absolute Distances 

The set of absolute distances of the samples is sorted and indexed. 

Step 17 - Calculate Rank Values 

The data samples are classified into one or more clusters. For example, the binary 
tree may be 'cut* at a specified merging height so that all clusters merged above that 
merging height are considered to belong to different clusters. Alternatively, the data 
samples may have been separated a priori into groups and clustering performed 
independently on each group. 
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For each cluster, as shown in outline in Figure 1 1: 

17a) Find the sample NSP in that cluster nearest to the test sample, using a method 
similar to step 13a) for finding the nearest neighbour of a sample, but restricting the 
method to the current cluster and the test sample. 

17b) Find the distance d from the test sample to the nearest sample NSP in the 
cluster, and the largest edge length e within the cluster, [f the cluster contains only one 
sample and therefore no edges, make e - d/2. lfd>e, the test sample is determined to lie 
outside the cluster altogether and the rank value R » e-d, which will be negative; the 
method then stops. 

17c) The test sample lies within the cluster. If the cluster has only a single sample, 
make R « 1- die and stop. 

17d) The cluster has multiple samples. Do the following steps: 

17da) Let T be the number of edges in the minimum spanning tree of the 

cluster. 

17db) Let CE be the index of the current edge under consideration; the 
edges are indexed in increasing length. CE is initially set to the first edge of height greater 
than or equal to the greater of d and (r-d), where r is the merge height of the NSP. 

17dc) Let NLE be the index of the first edge longer than the edge CE. 

17dd) Let NH be the number of edges of length less than the edge CE. 

17de) Let MEM be the distance from the test sample to the nearest sample 
considered so far; MIN is set to a maximum value initially. 

17df) Let SL be the length of the longest edge in the MST shorter than the 
edge CE; SL is initially set to zero. 

17dg) Let LL be the length of the current edge CE; LL is initially set to 

zero. 

17dh) Let ND be the number of edges in the minimum spanning tree of the 
cluster with length equal to LL; ND is set initially to 1. 

17di) Find LL for the current edge CE. 

17dj) Find ND. Increment CE by ND so that CE is now the index of the 
shortest edge of length greater than LL. 
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17dk) Define the set of 'active samples' AS as containing initially ail 
samples with merge height less than or equal to SL. The 'active region' AR is defined as 
the region of all samples within a distance SL of any of the active samples AS. 

I7dl) If LL is less than or equal to the greatest edge length in the minimum 
spanning tree, find the updated AR and check whether the test sample falls within it as 
follows: 

17dla) Add the samples with merge height LL to the active samples 
AS. For each of the samples with merge height LL, measure the distance TD to the test 
sample. If TD is less than MEN, set MIN as the greater of TD and SL. 

17dlb) If MIN is less than or equal to LL, go to step I7dn) to find 
the rank value, as the test sample lies within the active region. 

17dlc) Set SL«LL and let LL be the length of the new edge. Add ND 
to NH, as the number of edges of length less than LL has increased by ND. Set ND to be 
the number of edges in the minimum spanning tree of length LL. Add ND to CE so mat CE 
is the index of the shortest edge of length greater than LL. 

17dm) Go back to step 17dl). 

17dn) The rank value R is given as follows: 

R = X h (1) 

LL-SL T T 

Rank Estimation - Ellipsoidal Case 

The calculation of the rank value by interpolation will now be described in an 
alternative second-order case in which hyperellipsoidal boundaries are defined. Instead of 
defining spheroidal boundaries from each sample as shown in Figure 8, ellipsoidal 
boundaries are defined with the samples at either end of an edge as the foci. Step 17 is 
replaced by step 17' as follows: 

17a') Find the sample NSP in that cluster nearest to the test sample, using a method 
similar to step 13a) for finding the nearest neighbour of a sample, but restricting the 
method to the current cluster and the test sample. 

17b') Find the distance d from the test sample to the nearest sample NSP in the 
cluster, and the largest edge length e within the cluster. If the cluster contains only one 
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sample and therefore no edges, make e « d/2. If </>1.5x e, the test sample is determined to 
lie outside the cluster altogether and the rank value is assigned a value as follows: 

I7ba') Using the indexed absolute distances to order and limit the search, 
find the edge in the minimum spanning tree of the cluster which has the 
smallest string distance S to the test sample. The two samples connected by 
the edge can be found from the source(i) and dest(i) arrays. If the smallest 
current S is r, men any edge with a smaller S must have at least one of its 
samples lying within an absolute distance of 1.5 * r of the test sample; 
hence, the search is limited to this range, d takes the value of the smallest S, 
and R=L-e/± The procedure then stops. 
17c') The test sample lies within the cluster. If the cluster has only a single sample, 
make R - 1- die and stop. 

17d') The cluster has multiple samples. Do the following steps: 

17da') Let T be the number of edges in the minimum spanning tree of the 

cluster. 

17db') Let CE be the index of the current edge under consideration; the 
edges arc indexed in increasing length. CE is initially set to the first edge of height greater 
than or equal to two thirds of the greater of d and (r-d), where r is the merge height of the 
NSP. 

17dc*) Let NLE be the index of the first edge longer than the edge CE. 
17dd') Let NH be the number of edges of length less than the edge CE. 
17de') Let MIN be the distance from the test sample to the nearest sample 
considered so far; MIN is set to a maximum value initially. 

17df) Let SL be the length of the longest edge in the MST shorter than the 

edge CE. 

17dg') Let LL be the length of the current edge CE; LL is initially set to 

zero. 

1 7dh') Let ND be the number of edges in the minimum spanning tree of the 
cluster with length equal to LL; ND is set initially to I. 

1 7di' ) Find LL for the current edge CE. 
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Ud'p Find ND. Increment CE by ND so that CE is now the index of the 
shortest edge of length greater than LL. 

17dlO Define the set of 'active edges' AE as having no samples initially. 
The 'active region* AR is defined as the region of all samples within a 'string distance' S 
of LL of any edge of AE. The 'string distance 1 is one half of the difference between the 
sum SD of the distances from the test sample to each of the samples connected by the edge 
and the length of the edge: 

S - (SD-LL)/2 



17dl') Find the updated AR and check whether the test sample falls within it 

as follows: 

1 7dla') Add the samples with merge height LL to the active samples 
AS. For each of the samples with merge height LL, measure the string distance S to the test 
sample. If S is less than MIN, set MIN as the greater of S and SL. 

I7dlb') If MIN is less than or equal to LL, go to step 1 7dn') to find 
the rank value, as the test sample lies within the active region. 

^dlcO Set SLHLL and let LL be the length of the new edge. Add 
ND to NH, as the number of edges of length less than LL has increased by ND. Set ND to 
be the number of edges in the minimum spanning tree of length LL. Add ND to CE so that 
CE is the index of the shortest edge of length greater than LL. If there are no longer edges 
in the minimum spanning tree, then the test sample lies outside the cluster, so go to 17dn'). 
17dm') Go to 17dl') 

1 7dn r ) The estimated rank value is given by: 
n LL-MIN ND T-ND-NH 

R = X h (I J 

LL-SL T T 
17do*) The negative estimated rank is given by R= MIN - SL 



Classification 

Where there is more than one cluster, the test sample is assigned to the cluster 
having the greatest estimated rank value. 
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This process has many practical applications in the general field of artificial 
intelligence involving automatic recognition of an input, such as an image or sound. For 
example, in a voice recognition application a series of training samples are input at the data 
input I during a training phase. It is known a priori what sounds or phonemes the training 
samples are intended to represent, so the samples are divided according to intended 
classification and each classification of samples is clustered independently. In recognition 
mode, a rank value is calculated for a test sample in relation to each of the clusters and the 
test sample is classified according to comparison between these rank values. A •hard* 
classification may be made by assigning only one classification to the test sample, or a 
'soft* classification may be made by assigning a probability of the test sample belonging to 
each of a number of different classifications. The 4 soft' classifications may be used to 
classify a series of sounds by determining the relative probabilities of possible sequences of 
sounds, weighted by the 'soft' classification probability of each sound. 

This technique has the advantage that the envelopes of the clusters may partially 
overlap that of another cluster in dimension space, while still allowing a classification 
decision to be made. 

The technique may be applied to data samples each representing a spatial 
configuration, for example in the field of optical character recognition (OCR), in which the 
output is a representation of a recognised character, or robotic vision in which the output is 
one or more actions performed in response to the classification of the test sample. 

Hence, in apparatus for carrying out this process, the data input I is a sensor or 
sensor array, and the data output O is a representation of the classification of an input, 
which affects subsequent processing steps by the apparatus. 

Specific Example 

A specific example of estimating the rank value of a test sample will now be 
described with reference to Figure 9. The single cluster of Figure 8 is divided into two 
clusters by removing the longest edge between samples 4 and 6; this can be represented as 
cutting the binary tree at a height between 4.12 and 4.47. A test sample TP is given at 
coordinates (6,3) and it is desired to find the rank value of the test sample for each cluster, 
for the purpose of determining to which of the two clusters the test sample should belong. 
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The rank contours are now as shown in Figure 9. By way of comparison, the rank contours 

for the ellipsoidal case are as shown in Figure 10. 

Following the hypcrspheroidal case as described above, we take each cluster in 

turn. In the cluster of samples (I, 2, 3, 4), the closest sample to the test sample is sample 4, 

for which d is V5. e is VlO, so TP lies within this cluster. There are 3 edges, in increasing 

length: b, a, c. CE initially points to edge a, since this has a length equal to d, and LL=V5. 

Samples I, 2 and 3 are added to the active samples. Sample 3 is closer to TP, so MIN 

becomes Vl7. This is greater than LL, so we find the next longer edge, which is edge a, of 

length VS. Now we return to step 17dl) and add sample I to the active samples. However, 

sample 1 is further from TP than the samples already tested, so MIN is still Vl7, which is 

greater than LL. We therefore find the next longer edge, which is edge c. Sample 4 is added 

to the active samples, and LL = v*10, while SL = ^5. NUN becomes >/5, which is less than 

LL; hence, we calculate: 

_ VTo-VJ 1 3-1-2 1 
Vl0-V5 3 3 3 

as can be confirmed by observing that TP lies on the R=l/3 boundary. 

Although the above embodiments have been described with reference to a 

Euclidean metric, it will be appreciated that other types of metric may alternatively be 

used. Moreover, aspects of the present invention may be applied to non-metric data 

samples, and to clustering methods other than single-link clustering. 
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Specific embodiments of the present invention will now be described with 
reference to the accompanying drawings, in which: 

Figure I is a flowchart showing the principal steps of a method in an embodiment 
of the present invention; 

Figure 2 is a flowchart showing the detailed steps of step 1 3a of the flowchart of 
Figure I; 

Figure 3 shows the minimum spanning tree of a set of data samples, calculated 
using the method shown in Figures 1 and 2; 

Figure 4 shows a binary tree at an intermediate stage of the calculation of the 
minimum spanning tree; 

Figure 5 shows the binary tree at the final stage of the calculation; 

Figure 6 is a generic diagram of apparatus for carrying out the method in a technical 
process; 

Figure 7 shows the binary tree of Figure 5 sorted in order of merge height; 

Figure 8 shows the hyperspherical contours of a rank value function for a cluster of 
the data samples of Figure 3; 

Figure 9 shows the hyperspherical contours of a rank value function for a sub- 
cluster of the data samples of Figure 3; 

Figure 10 shows the hyperellipsoidal contours of a rank value function for the sub- 
cluster of Figure 9; and 

Figure 1 1 is a flow diagram of the calculation of the rank value function. 
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Fig. 1 
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Fig, 2 
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Fig. 5 



Merge 
Height 5 



0 1 2 
Sample 



3 4 5 7 6 8 9 



Fig. 6 




^132 0 0 2- 1 09 536 




(63) 



&m 2 0 02-109536 



Fig. 9 
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Fig. 11 
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A data clustering method involves techniques for improving the speed of generation of 
clustering data representing hierarchical clustering of a set of data samples. The techniques 
include the selection of clusters in increasing size for selecting the nearest other cluster for 
merging, ordering the data samples according to absolute distance from a reference and 
searching for nearest neighbours within a restricted index range, and making distance 
comparisons by summing the contributions from components in each dimension in turn in 
order of the interquartile ranges of components of the data samples in each dimension. A 
data classification method involves calculating a rank value for a test sample in relation to 
a cluster of data samples, by taking into account the dissimilarities of the data samples at 
either end of the closest edge to the data sample and/or by calculating as a function of a test 
sample dissimilarity of the test sample to the most similar data sample within the cluster, 
unless the test sample dissimilarity is less than the dissimilarity of an edge in a minimum 
spanning tree which has the greatest dissimilarity less than an edge connected to the most 
similar data sample. 

The applications of the methods include data compression, feature extraction, unmixing, 
data mining and browsing, network design and pattern recognition. 
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